Activity correlation based optimal clustering for clock gating for ultra-low power vlsi

ABSTRACT

A clustering bus-specific clock gating method is described to reduce the dynamic power consumed by redundant clock ticks in gate-level. The method exploits correlations between flip-flops for clock gating. An activity correlation matrix is introduced to describe the correlations between the flip-flops. Based on activity correlation information, the flip-flops are classified into several clusters. A payoff function is also described to find an optimal classification scheme. Based on the classification strategy, flip-flop clusters that are less active and more correlated will be gated.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional PatentApplication, Ser. No. 62/038,022, filed on 15 Aug. 2014. The co-pendingProvisional Patent Application is hereby incorporated by referenceherein in its entirety and is made a part hereof, including but notlimited to those portions which specifically appear hereinafter.

BACKGROUND OF THE INVENTION

This invention relates generally to reducing power consumption ofintegrated circuits, and, more particularly, to clock gating forreducing the dynamic power consumption of very large scale integrated(VLSI) circuits.

Advances in CMOS technology have enabled higher integration and higheroperational frequencies in present VLSI design. This is because theearly VLSI designers were concerned with area and speed more than thepower consumption. In recent years, however, the popularity of portabledevices, mostly powered by batteries, has made the power dissipation acomparable factor to area and speed.

One of the largest dynamic power consuming components of a synchronouscircuit is the clock distribution network, which is typicallyresponsible for 30%-40% for the dynamic power dissipation. Two factorsgenerally account for this phenomenon: 1) that the clock signal has atoggle rate of 1, which is the maximal value; 2) that the clock networkdrives large amounts of cells, including buffers, flip-flops, etc. Theselarge amounts of fan-out cells make the load capacitance of the clockdistribution network very large. The above two factors make the clockdistribution network consume a large portion of power consumption. Powercan be saved by optimizing the clock distribution networks. In realsequential circuits, the inputs of sequential logics do not toggle inevery cycle. Sequential logic wastes energy when the input does nottoggle and the clock signal still charges and discharges the load of theclock distribution network. Only sequential components need clocksignals, and in sequential circuits, the most used devices are flip-flopcircuits. Flip-flops are thought to be one of the most energy-consumingcomponents of digital circuits. Several power management techniques havebeen proposed to reduce power dissipation by eliminating the unnecessarytransitions of various signals in the circuits. These techniquesgenerally manage the idleness and the shutdown of parts of the circuitsto reduce power dissipation. Among those methods, the clock gating (CG)technique is the most well-known and common technique used for dynamicpower reduction. CG has been studied for a long time, and a number ofmethods have been proposed to improve the efficiency of CG. Fewconventional CG techniques take activity correlation into account, whilethe activity correlation plays a very important role in determining theefficiency of CG.

CG is not simply gating as many sequential devices as possible. There isa tradeoff between the power reduction by CG and extra power consumed bythe additional gates and latches for CG. There is a continuing need forimproved power saving and/or clock gating techniques for integratedcircuits.

SUMMARY OF THE INVENTION

A general object of the invention is to provide a method, and softwarefor automatically implementing the method, for correlating activitybetween flip-flops for clock gating, to reduce the dynamic powerconsumption of very large scale integrated (VLSI) circuits. A heuristicmethod and algorithm is proposed to find a sub-optimal clock gatingscheme, which obtains more power reduction compared to existingtechniques.

The general object of the invention can be attained, at least in part,through a method for improving power consumption in integrated circuitsby grouping circuits, such as flip-flop circuits, by activitycorrelation, such as clock toggles, and clock gating as a function ofthe grouped circuits. Embodiments of the method incorporate grouping thecircuits in an activity correlation matrix, with such correlationdesirably being performed during a predetermined number of clock cycles.

By using the activity correlation matrix of this invention, the methodand corresponding software can find groups of circuits that arecorrelated closely and gate them together. By considering activitycorrelation in the clock gating technique, power consumption can bereduced. In some embodiments of this invention, the method includes:grouping the circuits in an activity correlation matrix; sorting thecircuits from the activity correlation matrix in ascending order as afunction of a toggle rate; clustering the circuits having a highestcorrelation in a group; continuing the addition of circuits having thenext highest correlation to the group until a power gain is no longerincreasing and/or is above a predetermined threshold; and gating thecircuits not within the group.

The invention further includes a method for improving power consumptionin integrated circuits by: correlating flip-flop circuits as a functionof circuit activity; classifying the correlated circuits into aplurality of clusters; and gating at least one of the clusters includinglower activity flip-flop circuits. Embodiments of the invention furtherinclude determining a number of clusters to gate as a function of powersavings, wherein the power savings is determined as a function of powerreduction by the gating and power used for the correlating andclassifying steps. In some embodiments, the correlation is based upon apredetermined input vector timeframe.

Some embodiments according to this invention can be used in poweroptimization in gate-level of VLSI/ASIC design flow, for example, forthe purpose of reducing dynamic power. More broadly, the clusteringtechnique can be also used for power gating, which can also be appliedto groups of logic gates which are more correlated, to reduce leakagepower consumption.

In some embodiments according to this invention, the algorithm used forclustering is based on heuristics, which can only obtain a sub-optimalsolution for clustering. Also, the algorithm may suffer from localoptimization problems. Heuristics is widely used to solve NP-hardproblems in computer science. The sub-optimal and local optimizationproblems can be traded off by time and accuracy.

The method and software/system of this invention are desirablyautomatically executed or implemented on and/or through a computingplatform. Such computing platforms generally include one or moreprocessors for executing the method steps stored as coded softwareinstructions, at least one recordable medium for storing the softwareand/or matrix or other data produced by the method, an input/output(I/O) device, and a network interface capable of connecting eitherdirectly or indirectly to the Internet or other network.

Other objects and advantages will be apparent to those skilled in theart from the following detailed description taken in conjunction withthe appended claims and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1 and 2 illustrate correlation based clock gating, according toone embodiment of this invention.

FIG. 3 shows a gated flip-flop of a bus-specific clock gating structure,according to one embodiment of this invention.

FIG. 4 illustrates pseudo code for a clustering algorithm according toone embodiment of this invention.

FIG. 5 includes plots summarizing payoff function result versus powermeasurement.

FIG. 6 is a plot comparing the performances of the three clock gatingtechniques.

FIG. 7 summarizes a comparison between OBSC and CBSC, from the examplesbelow, using a Synopsys Power Compiler.

DESCRIPTION OF THE INVENTION

The present invention provides a clustering bus-specific clock (CBSC)gating technique, which produces a better performance on powerreduction. In the perspective of mathematics, the CBSC gating removesthe constraint on group numbers, and obtains a better solution for theclock gating optimization problem. The method exploits the activitycorrelations between flip-flops, and classifies them into severalclusters. In addition, the method uses a different training input vectorand test input vector. To exploit the correlations between flip-flops,embodiments of this invention incorporate an activity correlationmatrix. In some embodiments of this invention determine a payofffunction, which is more efficient, to find an optimal classificationscheme.

FIGS. 1 and 2 illustrate correlation based clock gating. In FIG. 1,there are three flip-flops (FFs): FF1, FF2 and FF3. FF1 and FF3 have thesame toggle numbers, FF2 has two more toggles. Within a same number ofclock cycles (as in FIG. 1, the number of cycles is 14), toggle rates(TR) of FF1, FF2 and FF3, which are denoted as TR(1), TR(2) and TR(3),are 4/14, 6/14 and 4/14 respectively. If two FFs have to be gated, theknown techniques, which just take the TR into consideration, will chooseFF1 and FF3 as the gated group. While actually, it is see from FIG. 2that it saves more wasting clock toggles when FF1 and FF2 are gatedtogether.

From FIG. 1, FF1 and FF2 are more correlated according to embodiments ofthis invention, since they are more likely to toggle together. FIG. 2shows the clock signal of gating different groups, and it can be seenthat when FF1 and FF2 are gated together, there will be less clocktoggles, which means less power consumption.

The method of this invention exploits activity correlations between theflip-flops. In some embodiments of this invention the activitycorrelation is based on the assumption that some flip-flops in aspecific design might have certain relations which make them tend totoggle together. In some embodiments, the basic concept of activitycorrelation is defined as: given a certain input vector, during a periodin which the input vector is effective, the toggle number relationsbetween devices (e.g., flip-flop, used herein for description). Thetoggle numbers of each flip-flop are counted during the period when acertain input vector is in effect, and these toggle numbers to someextent reflect the action of the flip-flop to the certain input vector.If two flip-flops have the same or similar toggle patterns, then theyare considered related by activity; to the contrary, if the twoflip-flops have very large toggle number difference, then they areconsidered activity irrelative. In the activity correlation matrixbuilding process, the correlation of each flip-flop is statisticallycounted for a certain time period, for example 2 clock cyclesillustrated below in Table 1.

TABLE 1 FF1 toggle FF2 toggle FF3 toggle period Clock ticks countscounts counts 1 1-2 1 1 0 2 3-4 1 1 0 3 5-6 1 1 0 4 7-8 1 1 1 5  9-10 00 1 6 11-12 0 2 0 7 13-14 0 0 2

The next step of building the activity correlation matrix calculates thecorrelation. Using FF1 as an example, the correlation between FF1 andFF2, FF3 is calculated. First, calculate the correlation between FF1 andFF2 by subtracting the toggle count of FF2 from that of FF1 in eachperiod, and summing the absolute values of differences: Sumdif(FF1,FF2)=|1-1|+|1-1|+|1-1|+|1-1|+|0-0|+|0-2|+|0-0|=2,Sum_dif(FF1,FF3)=|1-0|+|1-0|+|1-0|+|1-1|+|0-1|+|0-0|+|0-2|=6. Table 2summarizes the results for the correlation of FF1, FF2, and FF3.

TABLE 2 FF1 FF2 FF3 FF1 0 2 6 FF2 2 0 8 FF3 6 8 0

Next, the results are normalized by:

cor(FF1,FF2)=(max_sum_dif−Sum_dif(FF1,FF2)/max_sum_dif)=(8−2)/8

The resulting activity correlation matrix is shown below. From theactivity correlation matrix it is clear that FF1 and FF2 are morecorrelated than FF1 and FF3.

FF1 FF2 FF3 FF1 1 ¾ ¼ FF2 ¾ 1 0 FF3 ¼ 0 1

In embodiments of this invention, the groups of the circuits in anactivity correlation matrix are then sorted in ascending order as afunction of a toggle rate. The method then clusters the circuits basedupon correlation rate, with the highest correlation in one group.Circuits can be continually adding to the cluster, on the basis ofhaving the next highest correlation to the group, until a power gain isno longer increasing and/or is above a predetermined threshold. Anyflip-flops not within the clustered group can be gated to save power. Insome embodiments according to this invention, a procedure of theclustering algorithm is summarized as:

-   -   (1) Obtain the activity correlation matrix;    -   (2) Sort the flip-flops in ascending order based on their toggle        rate, and put all of them in set A;    -   (3) Get a flip-flop FFx from set A, which has the least toggle        rate. If A is empty, go to (8);    -   (4) Get the most correlated flip-flops of FFx from A based on        the activity correlation matrix, and group them together, if A        is empty, go to (8);    -   (5) Then calculate the payoff with a specific payoff function;    -   (6) If the payoff is greater than 0, make the flip-flops into        the same group, and remove them from set A; then go to (4);    -   (7) If the payoff is greater than 0 and A is not empty, go to        (3); and    -   (8) Return.

The present invention is described in further detail in connection withthe following examples which illustrate or simulate various aspectsinvolved in the practice of the invention. It is to be understood thatall changes that come within the spirit of the invention are desired tobe protected and thus the invention is not to be construed as limited bythese examples.

As described above, circuit activity information is used to build anactivity correlation matrix. In this example, a value change dump (VCD)file was used, which supplied sufficient information of the activitiesof each cell in a design. To make the correlation information goodenough to resemble the physical circuit, a certain amount of randominput vectors were used. Actually, the greater number of input vectorsused, the more accurate the correlation model is. One consideration forsequential circuits, is that there are usually memories elements (latchor flip-flop). To record the memory elements information, each inputvector was held for several cycles in the training test bench. Note thatthe input was only held for several cycles in the training input vector(for generating the Activity correlation matrix). When it comes to areal application, this constraint is not a concern, and the input vectorcan change every cycle if necessary.

In the training test bench, every randomly generated input vector washeld for 10 cycles. Each period was named as a duration, during whichone certain input vector was held. Supposing a total of M input vectors,there were a total of M durations to count. In every duration, thetoggle numbers of each flip-flop output were counted, and Θ_(k)=[α₁, α₂,. . . , α_(N)] was used to denote the counting record for one duration,where Θ_(k) is defined as toggle number vector, k denotes the k_(th)duration, α_(i) denotes the toggle numbers of the i_(th) flip-flop, Ndenotes the number of flip-flops. An activity N×N correlation matrix Ψis then defined. Each row of the activity correlation matrix is definedas

${{\Psi \left( {i,\text{:}} \right)} = {\sum\limits_{k = 0}^{M}{{\alpha_{i}^{k} - \Theta_{k}}}}};$

where α_(i) ^(k) denotes the toggle numbers of the i_(th) flip-flop inthe k_(th) duration, M denotes the duration numbers. Afternormalization, the activity correlation matrix can be obtained, whichhas the same properties as the correlation matrix in statistics: 1) itis a symmetric matrix; 2) the diagonal entries are all 1.

As the activity correlation matrix supplies the activity correlationsinformation, the flip-flops can easily be classified. However, a payofffunction should be defined to measure the performance of differentclassification schemes, and hence, to find an optimal classification. Inembodiments of this invention, the payoff function is defined toconsider the tradeoff of the power reduction by clock gating and extrapower dissipated by the additional gates and latches for clock gating asdiscussed above.

L. Li et al., “Activity-driven optimized bus-specific-clock-gating forultra-low-power smart space applications,” Journal of IETCommunications, vol. 5, iss. 17, pp. 2501-2508 (2011), provides a powerestimation model (referred to as “OBSC”), which was used to find anoptimal classification scheme. Because the OBSC technique needs toiterate many times to find an optimal scheme, the efficiency of thepower estimation model is very critical. Unfortunately, the powerestimation model is not so efficient. With the increasing scale of thecircuit, the computation complexity of the OBSC increases exponentially.One reason for this is because the power estimation model is soinefficient that it is impossible to get a result within an acceptabletime. In some embodiments of this invention, the power consumption isnot estimated, but instead a payoff function is built or determined,which can indicate the tradeoff between the power saved by clock gatingand the additional power caused by the clock gating logics. This payofffunction is relatively easier, more efficient, and, most importantly, itis sufficient to measure the power reduction of the different clusteringscheme.

Clock gating techniques are mainly used to reduce the dynamic power ofdigital circuits. Generally, dynamic power can be categorized into twoparts: 1) power dissipated by charging and discharging the loadcapacitance, hereby named switching power; 2) power caused by shortcircuit current, hereby named short circuit power. Switching power isgiven by:

P _(SW) =α·C _(L) ·V _(DD) ² ·f   (1)

where P_(SW) is the switching power, α is the activity factor (e.g., atoggle rate), C_(L) is the load capacitance, V_(DD) is the supplyvoltage, and f is the working frequency.

Unlike the switching power, the short circuit power varies with manyfactors. It is strongly sensitive to the ratio of the threshold voltageto supply voltage: V_(th)/V_(DD). It has also been observed dependent onthe input ramp, the load capacitance and the transistor size. Because ofthese multiple dependencies, the short circuit power estimation modelsare usually complex. S. Turgis et al., “Explicit evaluation ofshort-circuit power dissipation for CMOS logic structures,” Proceedingsof ISLPD, Dana Point Resort, April 1995, pp-129-134, proposed a firstorder formulation for short circuit power dissipation. The main idea ofthis formulation is using the parameter C_(SC), short circuitcapacitance, which has no physical meaning and is just an equivalent wayto represent the charge transfer. With this ‘short circuit capacitance’,the short circuit power can be expressed in the same way as switchingpower:

P _(SC) =α·C _(SC) ·V _(DD) ² ·f   (2)

where P_(SC) is the short circuit power, and C_(SC) is the short circuitcapacitance. With equation (1) and (2), the total dynamic power can begiven as:

P _(dynamic)=α·(C _(L) +C _(SC))·V _(DD) ² ·f   (3)

The clock gating technique can save dynamic power by eliminating wastedclock toggles. However, the additional logics introduced by the clockgating consume extra power. So in embodiments of this invention, apayoff function consists of two parts: 1) saved power by clock gating:P_(saved); and 2) extra power introduced by clock gating logics:P_(extra). The payoff function can be provided by equation (4):

F _(payoff) =P _(saved) −P _(extra)   (4)

FIG. 3 shows a gated flip-flop of bus-specific clock gating structure.From the viewpoint of the flip-flop, there is only one XOR gateintroduced to the load. The actions of inputs and outputs of flip-flopsare not, and should not be, affected. The dynamic power is determined bythe supply voltage, load capacitance and toggle rate (activity factor).The supply voltage is supposed to be constant, so the dynamic power canbe analyzed from the other two factors. First, in clock gated flip-flop,the wasted clock toggles are gated, which resulted in a reduced togglerate. The dynamic power will decrease with the clock toggle rate.Secondly, the XOR gate increases the load capacitance of the flip-flop.With the same toggle rate at the output Q, the dynamic power willincrease with the load capacitance.

The dynamic power of flip-flop can be roughly modeled as a function oftoggle rates of each port. Based on equation (3), the dynamic power ofthe flip-flop can be expressed as:

P _(eff)=α·(C _(L) +C _(SC))·V _(DD) ² ·f   (5)

Flip-flops generally include four action states. The first state happenswhen the data input and clock signal are all toggling; the second statehappens when clock toggles and the data input does not; the third statehappens when data input toggles and the clock does not; and the laststate happens when neither the data input nor the clock toggles. Theeffective toggle rate depends on both data input and the clock input. Soa function is defined to describe the effective toggle rate:

α=ψ_(i)(TR_(clk) ^(i),TR_(d) ^(i))   (6)

Where, α is the effective toggle rate; TR_(clk) ^(i) is the toggle rateof clock signal in state i; TR_(d) ^(i) is the toggle rate of the datainput in state i; i ranges in (I,II,III,IV) and denotes the state.

Looking into the function and structure of clock gating, it can be seenthat the clock gating only has a major effect on state two, in whichclock gating techniques try to eliminate the clock toggles. In thisstate, the toggle rate of data input is 0. So the effective toggle ratein this state is only depending on TR_(clk) ^(II).

P _(II)=ψ_(II)(TR _(clk) ^(II) ,TR _(d) ^(II))·C _(tot) ·V _(DD) ² ·f

P _(II)=ψ′(TR _(clk) ^(II))C _(tot) ·V _(DD) ² ·f   (7)

Because the action in state II is monotonous (clock toggling, data inputholding), so the function ψ′(TR_(clk) ^(II)) is linear. Supposeψ′(TR_(clk) ^(II))=k·T_(clk) ^(II), then:

P _(II) =TR _(clk) ^(II) ·k·C _(tot) ·V _(DD) ² ·f   (8)

As a result, P_(saved) is obtained:

$\begin{matrix}{{P_{saved}^{i} = {P_{H}^{i} = {{TR}_{clk}^{i\; \_ \; {II}} \cdot k^{i} \cdot C_{tot}^{i} \cdot \left( V_{DD}^{i} \right)^{2} \cdot f}}}{P_{saved} = {{\sum\limits_{i = 1}^{N}P_{II}^{i}} = {\sum\limits_{i = 1}^{N}{{TR}_{clk}^{i\; \_ \; {II}} \cdot P_{unit}^{i}}}}}} & (9)\end{matrix}$

where P_(unit) is a parameter depending on the cell library, and it canbe extracted from the library.

There is minor effect on the power of other states, which is caused bythe introduced load capacitance of the XOR gate. This effect can becompensated in the calculation of P_(extra), which is the extra powerintroduced by clock gating logics.

Assuming N flip-flops are to be gated together, in BSC style clockgating, the extra logics include: NXOR gates, (N−1) OR gates(approximately), 1 latch and 1 AND gate. Note, with different synthesistools, the logic cells used for clock gating and the logic numbers mightvary. For example, synthesis tools might use 3-input OR gate or 2-inputOR gate. In the payoff function according to one embodiment of thisinvention, it is assumes that the OR gate are all 2-input cells.However, this assumption does not affect the accuracy of the payofffunction; because in the payoff function, only estimating the trend ofthe power varying is needed, rather than the precise power consumption.Also, the power varying trend is sufficient for the clusteringalgorithm.

In bus-specific clock gating structure, each gated flip-flop needs anextra XOR gate to detect the states of input and output of flip-flop.Because of the delay, the toggle rate of XOR is twice of that ofcorresponding flip-flop output.

$\begin{matrix}{{P_{{extra}\; \_ \; {XOR}}^{i} = {2*{{TR}_{{FF}\; \_ \; Q}^{i} \cdot P_{{unit}\; \_ \; {XOR}}^{i}}}}{P_{{extra}\; \_ \; {XOR}} = {2*{\sum\limits_{i = 1}^{N}{{TR}_{{FF}\; \_ \; Q}^{i} \cdot P_{{unit}\; \_ \; {XOR}}^{i}}}}}} & (10)\end{matrix}$

The number of OR gates needed for clock gating depends on the synthesistool, but one can generally use (N−1) 2-input OR gate to estimate thepower varying trend. The toggle rate of each OR gate is affected by thecombination of inputs. To leave a margin for the payoff function, themaximum of the two inputs can be used as the output toggle rate. Note,the OR gate consists a tiny part of the extra power. So a roughestimation will be enough; the extra power of OR gates:

$\begin{matrix}{P_{{extra}\; \_ \; {OR}} = {\sum\limits_{i = 1}^{N - 1}{{{Max}\left( {{TR}_{{FF}\; \_ \; Q}^{i},{TR}_{{FF}\; \_ \; Q}^{i + 1}} \right)} \cdot P_{{unit}\; \_ \; {XOR}}^{i}}}} & (11)\end{matrix}$

In a bus-specific clock gating structure, there will be only one latch.The latch has one similarity with flip-flop. It also has multipleoperation states, however, unlike the flip-flop, of which just one statecan be considered; the latch is an exotic device, whose all operationstates should be considered. Since input of the latch is a constantclock signal, the enable signal has two states. So the latch has twooperation states: 1) enable; and 2) disabled:

P _(extra) _(—) _(latch) =P _(state) _(—) _(I) +TR _(tot) *P _(unit)_(—) _(latch)   (12)

where TR_(tot) is the total toggle rate after a group of flip-flops isgated together.

In the bus-specific clock gating structure, there is only one AND gate.However, it is the largest part of the extra power consumption. Becausethe AND gate has a very large fan out, N flip-flops. Its toggle rate isTR_(tot). The power model is given as:

P_(extra) _(—) _(AND) =TR _(tot) ·P _(unit) _(—) _(AND) +N·TR _(tot) ·P_(FF) _(—) _(load)   (13)

As discussed above, the extra power introduced by the XOR gate to theflip-flops is to be considered:

$\begin{matrix}{P_{{extra}\; \_ \; {com}} = {\sum\limits_{i = 1}^{N}{N \cdot {TR}_{{FF}\; \_ \; Q}^{i} \cdot P_{{XOR}\; \_ \; {load}}}}} & (14)\end{matrix}$

Substituting equation (9)-(14) into equation (4), provides the finalform of the payoff function:

$\begin{matrix}{{F_{payoff} = {{\sum\limits_{i = 1}^{N}{{TR}_{clk}^{i\; \_ \; {II}} \cdot P_{unit}^{i}}} - \left( {{\sum\limits_{i = 1}^{N}{{TR}_{{FF}\; \_ \; Q}^{i} \cdot P_{{unit}\; \_ \; {XOR}}^{i}}} + {\sum\limits_{i = 1}^{N - 1}{{{Max}\left( {{TR}_{{FF}\; \_ \; Q}^{i},{TR}_{{FF}\; \_ \; Q}^{i + 1}} \right)} \cdot P_{{unit}\; \_ \; {XOR}}^{i}}} + P_{{state}\; \_ \; I} + {{TR}_{tot}*P_{{unit}\; \_ \; {latch}}} + {{TR}_{tot} \cdot P_{{unit}\; \_ \; {AND}}} + {N \cdot {TR}_{tot} \cdot P_{{FF}\; \_ \; {load}}} + {\sum\limits_{i = 1}^{N}{N \cdot {TR}_{{FF}\; \_ \; Q}^{i} \cdot P_{{XOR}\; \_ \; {load}}}}} \right)}}\;} & (15)\end{matrix}$

After obtaining the activity correlation matrix and payoff function, theflip-flops can be classified. The clustering algorithm in CBSC allowsfor listing the flip-flops in ascending order of their toggle rate, andwith each flip-flop, searching the activity correlation matrix for themost correlated flip-flops and grouping them together as a cluster. Thenthe payoff function is used to obtain the power gain. If the power gainis larger than a threshold and is increasing, then the method continuesto add the most correlated flip-flop of the rest to the group. The stepsare repeated, looking for correlated flip-flops, until the payofffunction stop increasing. FIG. 4 illustrates pseudo code for theclustering algorithm according to one embodiment of this invention.

The proposed payoff function, according to different embodiments of thisinvention, was tested on part of ISCAS'89 benchmark circuits (s298,s9234, and s38417) to verify the validity. The proposed CBSC was testedon all ISCAS'89 benchmark circuits, and compared with the OBSC techniquedescribed above, as well as an automatic clock gating (ACG) techniqueused in Synopsys Power Compiler.

The payoff function is one of the key parts of the CBSC. It is used tomeasure the performance of each classification scheme, and find out anoptimal clustering scheme. So, the first step of the experiment was toverify the validity of the payoff function. The subject circuits had 14,211, and 1636 flip-flops respectively. They represented the wholebenchmark, because their numbers of flip-flops are standing in the smallamount group (flip-flop number range from 3 to 29), moderate amountgroup (flip-flop number range from 32 to 211) and large amount group(flip-flop number ranging from 534 to 1728) in the whole set ofbenchmarks.

In the verification of the payoff function, the flip-flops were allsorted in ascending order, and then the flip-flop were added into thegated group one by one. In each step the power consumption was recorded.FIG. 5 shows the verification results, with the dynamic power curveshowing the power measurement of each step. In the meanwhile, the payofffunction was used to predict the power gain of each step. In FIG. 5, thepayoff curve shows the power gain of each clock gating step. The resultsare all normalized to 1.

From FIG. 5, it is seen that the payoff function can predict the powerchanging trend. In all the (a), (b) and (c) part of FIG. 5, the lowestpower consumption occurs where the highest power gain in the payofffunction curve occurs. Also, the power measurement curve rises when thepayoff function curve drops; and the power measurement curve drops whenthe payoff function curve rises. The payoff function provides theability to measure the performance of different clock gating schemes.

With the payoff function, the CBSC algorithm was implemented to clusterall the flip-flops in each benchmark circuit for dynamic poweroptimization. Table 3 shows the clustering results of the CBSCalgorithm, as well as the OBSC gating scheme. The OBSC gating scheme canbe considered as a specific case of the CBSC, which has only onecluster. In the CBSC technique, variable clusters for each benchmarkcircuits are made based on the power reduction effect.

TABLE 3 Comparison of gated Flip-flops Gated FFs OBSC CBSC FF FF ClusterFF Cluster Benchmark No. No. No. No. No. S27 3 1 1 1 1 S298 14 9 1 9 4S344 15 5 1 5 2 S349 15 5 1 5 2 S382 21 16 1 16 3 S386 6 4 1 4 2 S400 2116 1 16 3 S420 16 13 1 14 5 S444 21 14 1 16 3 S510 6 4 1 2 1 S526 21 141 16 4 S526n 21 14 1 16 4 S641 19 13 1 13 4 S713 19 12 1 13 4 S820 5 3 13 1 S832 5 3 1 3 1 S838 32 28 1 30 5 S953 29 23 1 21 7 S1196 18 5 1 6 3S1238 18 5 1 6 3 S1423 74 38 1 49 11 S1488 6 3 1 5 2 S1494 6 3 1 5 2S5378 179 95 1 100 15 S9234 211 150 1 168 12 S13207 638 436 1 481 28S15850 534 288 1 424 51 S35932 1728 951 1 1317 114 S38417 1636 999 11295 50 S38584 1426 148 1 636 107

The power analysis tool used in the experiment was Synopsys PowerCompiler. For comparison, four groups of power consumption data weremeasured, three of which were implemented with different clock gatingtechnique. In Table 4, the first column shows the benchmark circuit inISCAS'89; the second column shows the flip-flop numbers of each circuit;the third column shows the dynamic power consumption of the originalcircuit without any power optimization scheme; the fourth column showsthe power consumption of the benchmark circuits with automatic clockgating technique in Synopsys Power Compiler; the fifth column shows thepower consumption of the benchmark circuits with OBSC gating scheme; andthe last column shows the power consumption of benchmark circuits withthe invented CBSC gating scheme.

TABLE 4 Power measurements FF Dynamic Power (uW) Circuit No. OriginalACG OBSC CBSC S27 3 28.877 29.413 27.504 27.504 S298 14 128.510 134.47385.439 88.540 S344 15 168.784 161.185 152.516 152.780 S349 15 170.154161.185 154.030 154.333 S382 21 161.364 154.160 81.600 77.547 S386 6120.351 82.776 102.185 103.606 S400 21 161.907 156.229 82.151 78.093S420 16 130.053 83.701 63.148 64.921 S444 21 162.987 152.635 85.11279.745 S510 6 115.300 148.970 109.368 111.465 S526 21 184.348 155.413106.470 103.311 S526n 21 183.734 154.646 105.920 102.607 S641 19 167.708135.207 112.667 107.895 S713 19 169.978 135.208 116.302 115.248 S820 5191.520 102.238 178.696 178.696 S832 5 199.308 110.255 185.329 185.329S838 32 238.749 135.666 83.536 79.821 S953 29 275.718 281.259 218.652196.430 S1196 18 383.054 350.295 368.210 364.992 S1238 18 404.398350.553 389.612 386.379 S1423 74 663.965 629.267 478.240 442.739 S1488 6252.270 201.765 240.630 238.060 S1494 6 254.934 212.080 242.486 240.650S5378 179 1816.6 1563.5 1305.3 1276.2 S9234 211 1072.7 754.754 855.035852.637 S13207 638 4769.7 2696.1 2783.6 2234.3 S15850 534 3920.8 2708.22516.2 1904.3 S35932 1728 18137.7 14623.5 14196.6 12517.6 S38417 163612473.0 7403.8 6692.2 5411.0 S38584 1426 15280.2 13386.0 14636.3 13524.3

FIG. 6 includes data from Table 4, and compares the performances of thethree clock gating techniques. In FIG. 6, the X-axis denotes the sorted(by flip-flop number) benchmark indices; the Y-axis denotes the absolutepower reduction. The curve marked by rectangles denotes the powerreduction by ACG technique; the curve marked by cross signs denotes thepower reduction by the OBSC gating technique; and the curve marked bydots denotes the power reduction by proposed CBSC technique. FIG. 6( a)shows the direct curves, and FIG. 6( b) shows the curves resulted from4-th order polynomial data fitting algorithm.

FIG. 6 shows that when the flip-flop number is small, the differencebetween the three techniques is tiny. However, when the flip-flopnumbers increase, the CB SC technique saved much more power than theother two, and the advantage is increasing with the flip-flop number.

FIG. 7 summarizes a comparison between OBSC and CBSC using the SynopsysPower Compiler. FIG. 7( a) shows the power reduction of CBSC versusOBSC. The X-axis denotes the benchmark circuit index by ascending orderof flip-flop numbers, the Y-axis denotes the power reduction of CBSCversus OBSC. FIG. 7( a) shows that as the flip-flop number increased,the CBSC saved more and more power than OBSC. In FIG. 7( b), theadvantage of CBSC over OBSC is compared in percentage scale, and itshows that, in the small-number benchmark circuits, OBSC was a littlebetter than CBSC. However, the advantage is limited by 5%. As theflip-flop number increases, the CBSC reduced as much as 24.31% power onthe basis of the OBSC. This is a reasonable result, because insmall-number flip-flop circuits, the CBSC may not have many clusteringoptions, and a small number flip-flop usually means the circuit scale isalso small, and the flip-flops are less correlated.

FIG. 7( c) shows the relation between the advantage of CBSC over OBSCand the flip-flop numbers. The X-axis denotes the flip-flop number inlogarithmic scale; the Y-axis denotes the power reduction of CBSC on thebasis of OBSC. The curve is obtained by 4-order polynomial data fittingalgorithm. FIG. 7( c) shows that the power reduction of CBSC over OBSCwas exponentially increasing with the number of flip-flop in logarithmicscale. In FIG. 7( d), the X-axis denotes the flip-flop number inlogarithmic scale, and the Y-axis denotes the power reduction of CBSC onthe basis of OBSC, in percentage scale. The curve is data fitted by4-order polynomial data fitting algorithm. FIG. 7( d) shows that thecurve was above zero and swung around 15%, and in conclusion, the CBSCwas overall better than the OBSC.

Thus, the invention provides a new data structure, namely an activitycorrelation matrix, to denote the activity correlation of flip-flops ina circuit. The activity correlation matrix is based on the activitycorrelation. This feature makes it suitable for dynamic poweroptimization. The invention further provides a payoff function tomeasure the performances of clock gating scheme. The proposed payofffunction is efficient in time and can measure the dynamic powerperformances. The activity correlation matrix and payoff functiontogether provide for the clustering bus-specific clock gating method ofthis invention.

The invention illustratively disclosed herein suitably may be practicedin the absence of any element, part, step, component, or ingredientwhich is not specifically disclosed herein.

While in the foregoing detailed description this invention has beendescribed in relation to certain preferred embodiments thereof, and manydetails have been set forth for purposes of illustration, it will beapparent to those skilled in the art that the invention is susceptibleto additional embodiments and that certain of the details describedherein can be varied considerably without departing from the basicprinciples of the invention.

What is claimed is:
 1. A method for improving power consumption inintegrated circuits, the method comprising grouping circuits by activitycorrelation and clock gating as a function of the grouped circuits. 2.The method of claim 1, wherein the circuits comprise flip-flop circuits.3. The method of claim 2, further comprising reducing clock toggles as afunction of the grouped circuits.
 4. The method of claim 1, wherein thegrouping comprises correlating the circuits as a function of circuittoggle.
 5. The method of claim 4, further comprising grouping thecircuits in an activity correlation matrix.
 6. The method of claim 1,further comprising correlating activity of circuits during apredetermined number of clock cycles.
 7. The method of claim 6, furthercomprising determining a correlation between a first circuit and asecond circuit during the predetermined number of clock cycles as afunction of an absolute value of a toggle count difference between thefirst circuit and the second circuit.
 8. The method of claim 7, furthercomprising normalizing the absolute value with a plurality of absolutevalues determined between pairs of a plurality of circuits during thepredetermined number of clock cycles.
 9. The method of claim 1, furthercomprising: grouping the circuits in an activity correlation matrix;sorting the circuits from the activity correlation matrix in ascendingorder as a function of a toggle rate; clustering the circuits having ahighest correlation in a group; continue adding the circuits having thenext highest correlation to the group until a power gain is no longerincreasing and/or is above a predetermined threshold; and gating thecircuits not within the group.
 10. A method for improving powerconsumption in integrated circuits, the method comprising: correlatingflip-flop circuits as a function of circuit activity; classifying thecorrelated circuits into a plurality of clusters; and gating at leastone of the clusters including lower activity flip-flop circuits.
 11. Themethod of claim 10, further comprising determining a number of clustersto gate as a function of power savings, wherein the power savings isdetermined as a function of power reduction by the gating and power usedfor the correlating and classifying steps.
 12. The method of claim 10,further comprising correlating flip-flop circuits for a predeterminedinput vector timeframe.
 13. The method of claim 10, further comprisingreducing clock toggles as a function of the clustered flip-flop circuitsand/or the gating.
 14. The method of claim 10, further comprisingcorrelating the flip-flop circuits as a function of circuit toggle. 15.The method of claim 10, further comprising grouping the flip-flopcircuits in an activity correlation matrix.
 16. The method of claim 10,further comprising correlating activity of the flip-flop circuits duringa predetermined number of clock cycles.
 17. The method of claim 16,further comprising determining a correlation between a first flip-flopcircuit and a second flip-flop circuit during the predetermined numberof clock cycles as a function of an absolute value of a toggle countdifference between the first flip-flop circuit and the second flip-flopcircuit.
 18. The method of claim 17, further comprising normalizing theabsolute value with a plurality of absolute values determined betweenpairs of a plurality of flip-flop circuits during the predeterminednumber of clock cycles.
 19. The method of claim 10, further comprising:grouping the flip-flop circuits in an activity correlation matrix;sorting the flip-flop circuits from the activity correlation matrix inascending order as a function of a toggle rate; clustering the flip-flopcircuits having a highest correlation in a group; continue adding theflip-flop circuits having the next highest correlation to the groupuntil a power gain is no longer increasing and/or is above apredetermined threshold; and gating the flip-flop circuits not withinthe group.